June 6, 2017
(taken from public description of UBC MDS course DSCI 561)
Linear models for a quantitative response variable, with multiple categorical and/or quantitative predictors. Matrix formulation of linear regression. Model assessment and prediction.
(taken from public description of UBC MDS course DSCI 561)
By the end of the course, students are expected to be able to:
(adapted from public description of UBC MDS course DSCI 561)
1/4 of the way into the course. Before this lesson, students will have learned the following:
By the end of this lesson students are expected to be able to:
from Data USA: https://datausa.io/
| display_name | income | mean_commute_minutes | median_property_value | non_eng_speakers_pct | owner_occupied_housing_units | pop |
|---|---|---|---|---|---|---|
| Texas | 53207 | 24.5090 | 136000 | 0.3503480 | 0.622325 | 26538614 |
| Pennsylvania | 53599 | 25.2695 | 166000 | 0.1064820 | 0.692052 | 12779559 |
| South Carolina | 45483 | 23.0552 | 139900 | 0.0686766 | 0.685914 | 4777576 |
| New Hampshire | 66779 | 25.2973 | 237300 | 0.0786821 | 0.709609 | 1324201 |
| Kansas | 52205 | 18.2779 | 132000 | 0.1130530 | 0.666891 | 2892987 |
| Hawaii | 69515 | 25.5813 | 515300 | 0.2521790 | 0.569030 | 1406299 |
mathematical forumula: \(Y = \beta_{0} + \beta_{1}X_{1} + \varepsilon\)
model forumula in R: Y ~ X1
\(Y = \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} + \beta_{3}X_{3} + \varepsilon\)
\(Y = \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} + \beta_{3}X_{1}X_{2} + \varepsilon\)
model forumula in R:
Y ~ X1 + X2 + X3
Y ~ X1 * X2
http://faculty.chicagobooth.edu/richard.hahn/teaching/formulanotation.pdf
\(\textrm{State}\: \textrm{median}\: \textrm{property}\: \textrm{value} = \beta_{0} + \beta_{1}\textrm{income} + \varepsilon\)
prop_model <- lm(median_property_value ~ income, data = prop_data)
create linear model object:
prop_model <- lm(median_property_value ~ income, data = prop_data)
print/view model object output:
summary(prop_model)
## ## Call: ## lm(formula = median_property_value ~ income, data = prop_data) ## ## Residuals: ## Min 1Q Median 3Q Max ## -65249 -36542 -6990 8003 219312 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -1.503e+05 4.393e+04 -3.422 0.00125 ** ## income 6.420e+00 7.999e-01 8.026 1.52e-10 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 58860 on 50 degrees of freedom ## Multiple R-squared: 0.563, Adjusted R-squared: 0.5542 ## F-statistic: 64.41 on 1 and 50 DF, p-value: 1.517e-10
summary():| model output | code |
|---|---|
| parameter/coefficient estimates | model_object$coefficients |
| residuals | model_object$residuals |
| predicted/fitted values | model_object$fitted.values |
| p-values for coefficients | summary(model_object)$coefficients[,4] |
| \(\sigma\) estimate | summary(model_object)$sigma |
| \(R^{2}\) | summary(model_object)$r.squared |
Working with model output in base R,
the good, the bad and the ugly…
broom:
A better way for working with model output in R
broom for working with model output in R:broom function |
model output |
|---|---|
tidy(model_object) | model parameters (e.g., coefficients and p-values) | |augment(model_object)| model data (e.g., residuals and predicted values) | |glance(model_object)` |
model quality, complexity and summaries (e.g., \(\sigma\) estimate and \(R^{2}\)) |